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Abstract 



From a practitioner's oerspective, current methods of obtaining 
reliability coefficients for mastery tests are quite laborious. For example, 
some methods demand two test administrations; while others require access to 
computer facilities and/or involve advanced measurement and statistical 
procedures. Thus, the present paper provides tables from which practitioners 
can read such reliability coefficients directly. The method used to construct 
the tables is reviewed; and comments on the accuracy of the tabled values are 
included. 

I 



Tables of Reliability Coefficients for Mastery Tests 



Methods for obtaining reliability estimates for mastery tests can be 
quite laborious from a practitioner's point of view. For example, the method 
proposed by Swaminathan, Hambleton, and Algina (1*>74) requires two 
administrations of the same or parallel tests. Given examinees' scores on 
both administrations and the cutoff score which distinguishes masters from 
nonmasters, two different reliability indices can be computed: (1) the 
agreement coefficient and (2) the kappa coefficient. 

The agreement coefficient is simply the proportion of examinees 
consistently classified as masters or as nonmasters on both administrations. 
When the mastery-nonmasterly classifications on the two administrations are 
summarized as In Table 1, the agreement coefficient, designated p 0 , is given 
by 

P 0 » P U + P 22 . (1) 

where pjj and p 22 are the proportions of examinees classified, 
respectively, as masters and nonmasters on both administrations. The upper 
bound of the agreement coefficient is 1.00, which occurs if classifications on 
both administrations are consistent for all examinees in the group. When the 



Insert Table 1 abcut here 



two administrations in Table 1 involve the same or parallel tests, the lower 
bound of the agreement coefficient is given by 



"chance " (P U + P l2 )(p l! + p 21> + (p 21 + P 22 )(p 12 + P 22 ) » 



(2) 



9 

ERIC 



where- P c ' na nce represents the expected proportion of consistent 
classifications when there is nq relationship between outcomes on the two test 
administrations (Huynh, 1978). 

The aforementioned kappa coefficient, designated K, is given by 



K " (p o " p chance )/(1 " p chance ) • 



(3) 



where p 0 and P cnanC e are obtalned from <!> and (2) ' ^ such » kaopa 
reflects the proportion of consistent classifications beyond that expected bv 
chance. The upper and lower bounds of kappa are 1.00 and 0, which occur, 
respectively, when there is perfect agreement and no relationship hetween 
outcomes on the two test administrations. 

Computer methods for estimating the agreement and kaopa coefficients from 
a single test administration have heen proposed, thereby eliminating the need 
for a second test administration (Huynh, 1976; Marshall & Haertel, 1976; 
Subkoviak, 1976). However, these methods are also difficult for practitioners 
to implement; since they require access to computer facilities and appropriate 
software, and they assume a somewhat advanced background in test theory. 

Approximation methods involving hand calculations of the agreement and 
kappa coefficient's from a single test administration have also been proposed 
(Huynh, 1976, p. 258; Peng & Subkoviak, 1980, p. 363). While these methods 
are the simplest thus far proposesd, they still involve the use of statistical 
tables of the bivariate and univariate normal distributions, which may not be 
entirely familiar to or readily available to practitioners. Thus, the present 
paper provides even greater simplicity: tables from which practitioners can 
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directly read approximate values of the agreement coefficient or the kappa 
coefficient. 

Tables of Agreement and Kappa Coefficients 

Table 2 contains approximate values of the agreement coefficient, and 
Table 3 contains approximate values of the kappa coefficient. 



Insert Tables 2 and 3 about here 



In order to use either table, two values are needed, which can be 
obtained from the data for a single test administration: (1) the norm- 
referenced reliability of the test (r) and (2) the cutoff score of the test 
expressed as a standard score (z). 

The norra-referenced reliability coefficient r can be computed using 
well-known and widely published formulae (Stanley, 1971); some of the more 
common indices of this type are the Kuder-Richardson 20 and 21 coefficients, 
Cronbach's alpha coefficient, and Hoyt's reliability coefficient. For 
example, Kuder-Richardson formula 21 provides practitioners with a very simple 
means of estimating r: 

nS 2 - M(n - M) , (4) 

r KR-2 1 - (n _ l)s 2 ' 

whore n is the number of test items, M is the mean of the scores, and 
S 2 Is the variance of the scores. Formula U is appropriate for test items 
scored as right or wrong; and it generally provides underestimates of 
reliability coefficient r, which lead to conservative estimates of the 
agreement and kappa coefficients in Tables 2 and 3. Tf items are not binary 
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scored or if less conservative estimates are desired, one of the other 
formulae noted previously for estimating r can be emoloved. 

The standard score z, which appears itt Tables 2 and 3, is obtained as 
follows: 



(c - .5 - m) (5) 
Z S 



where c is the raw cutoff score of the test, M is the mean of the scores, 
and S is the standard deviation of the scores. The value .S in Equation 5 
is a correction for continuity which arises from the fact that Tables 2 and 3 
were obtained by approximating the discrete test score distribution with the 
continuous normal distribution, as discussed later. The computed value of 
z given by Equation 5 may be either negative or positive. However, due to 
the symmetry of the approximating normal distribution, a negative z value 
like -.10 will lead to the same agreement or kappa coefficient as a positive 
z value like +.10. Thus, the unsigned or absolute value |z| is sufficient 
in order to make use of .Tables 2 and 3. 



An Example 

The use of Tables 2 and 3 will be illustrated employing a set of real 
data, which is described in greater detail elsewhere (Subkoviak, 1QR0). A ID 
Hem multiple-choice test, with a cutoff score of B , was administered to 
N = 30 students. The mean of the test was M » Fx/N - A .63, and the 
variance was S 2 =» fEx 2 /(N - 1)] - KEx) 2 /N(N - 1)1 - 3.27. Using Equation A, 
the reliability of the test was r KR _ 21 - fnS - M(n - M)l/f(n - l)S"l » 
f(10)(3.27) - (4.ft3)(in - A. 63)1/1(10 - 0(3.27)1 « .27, or r KR _ 21 - .30 
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approximately. Using Equation 5, the z value was z - (c - .5 - M)/S «= 

(R - .5 - 4.63)//3T27» 1.59, or z - 1 .60 approximately. Entering Table 2 
with r - .30 and |z| = 1.60, it can be seen that the coefficient of 
agreement is p 0 » .91, approximately. Similarly, the kappa coefficient 
provided in Table 3 is < » .10, approximately. The values of the agreement 
and kappa coefficient are quite different (p Q - .91 vs. < - .10) because 
the two coefficients arc Jistinct measures of consistency — a point discussed 
in greater detail in the next section of the paper. Since r - .27 and 
| z | - 1.59 in the example, somewhat more precise estimates of p 0 and < 
could be obtained from Tables 2 and 3 by interpolation (Subkoviak, 1080, 
pp. 141-142); but for practical purposes, the slight gain in precision may not 
be worth the additional effort. 

Tables 2 and 3 can also be used to determine the agreement and kappa 
coefficient of a test that has been lengthened or shortened by a factor of 
I. Suppose one wished to determine in the previous example what the agreement 
and kappa coefficient would be if 5 more items, eauivalent to the original 10, 
were added to the test. Since the lengthened test of 15 items is 1.5 times 
the original length, I would equal 1.5. The mean of the lengthened test 
would be = *M - (1.5) (4. 63) = 6.05; the cutoff of the lengthened test 
would be c„ = Ic - (1.5)(8) = 12; and the variance of the lengthened test 
would be s\ - *S 2 [1 + (i " Or] - (1.5)(3.27)tl + (1.5 - 1)(.27)1 - 5.57 
(Lord & Novick, 1068, P« 86). Substituting these values into Equation 5 
produces the result z % - (^ - .5 - M^/S^ - (12 - .5 - M5)//W = 
or z - 1.90 approximately. Finally, the reliability of the lengthened test 
would be x % * + (1 - Drl - ( 1 .5)( .27)/ f 1 + (1 .5 - 1)(.27)1 - .36, or 

r. - .40 approximatolv. Entering Tables 2 and 3 with r^ » .40 and 
|z I ■ 1.00, the agreement and kappa coefficients of the lengthened test 
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are p 0 » .95 and < * .12, approximately, which are slightly larger than 
the original values (p 0 » .91 and ic « .10); since lengthening a test 
increases Its reliability. 

Discussion 

It may be noted that corresponding entries in Tables 2 and 3 are 
generally quite different, as in the example where p 0 » .91 and < » .10. 
Such differences are due to the fact that the agreement coefficient and the 
kappa coefficient are distinct measures of consistency (see Rubkoviak, 1980, 
pp. 152-154). The agreement coefficient is the total proportion of consistent 
classifications on two test administrations; whereas the kappa coefficient 
reflects the proportion of consistent classifications, beyond that expected by 
chance. In concrete terms, what this means is that the kappa coefficient is 
more sensitive than the agreement coefficient to changes in test reliability 
r, as can be seen by comparing corresponding rows of Tables 2 and 3; and 
kappa is less sensitive to changes in |z|, which reflect the location of the 
cutoff within the distribution of scores, as can be seen by comparing 
corresponding columns of Tables 2 and 3* An awareness of these differences is 
important when interpreting and reporting values of the two coefficients. 

The question 'of what is an acceptable value of an agreement or kappa 
coefficient naturally arises when interpreting and reporting obtained values 
of these indices. Consider the coefficient of agreement (p Q ), which can be 
thought of as the probability that a randomly selected examinee will be 
consistently classified on two tert replications. The question of how large 
this probability value should be depends upon the seriousness of the decisions 
being made with the test results. If the test is being used to decide who 



will graduate and who will not, this probability should be quite large, 
perhaps .95, as might occur in Table 2 with a published test having 
reliability r * .90 and standardized cutoff z * -1.50 (a standard that 
implies about 11 of a normally distributed group score below the cutoff). On 
the other hand, if the test results are being used to make routine classroom 
decisions like who will move-on to the next unit of instri&tion and who will 
remain on the present unit, the probability can be somewhat lower, perhaps 
.85, as might occur in Table 2 with a teacher-made test having reliability 
r « .70 and standardized cutoff z » -1.00 (implying about 165; of a normally 
distributed class score below the cutoff). 

The question of what constitutes an acceptable value of a kappa 
coefficient can best be answered by first reviewing what the coefficient 
measures. The formula for kappa (3) involves the values P c hance» Po» anc * 
1.0 which represent, respectively, the probability of consistent 
classification when no relationship exists, the observed relationship exists, 
and a perfect relationship exists between the outcomes on two test 
administrations. Therefore, the numerator of kappa (p Q - P c hance^ reflects 
the gain in consistency between the no relationship condition (P c hance^ aUfl 
the observed relationship (p 0 ); and the denominator (1 - Pchance^ reflects 
the maximum gain in consistency possible between the no relationship 
condition (^hance^ and the P erfect relationship condition (1.0). Thus, 

< « ( p - p )/(i - p ) is a ratio of the actual gain in 

VK o K chance r chance' 

consistency due to the test to the maximum gain possible; or in other words, 
kappa reports the actual contribution the test makes to consistency as a 
proportion of the maximum possible contribution that could be made. Kapna, 
therefore, is a measure of the extent to which a test is performing up to the 
maximum possible limit; and one normally expects more, in this sense, of 
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published tests than teacher-made tests. For example, published tests should 

probably be expected to have kappa values of .50 or greater, as would occur in 

Table 3 for a published test having reliability r « .90 and standardize 

cutoff | z I between .00 and 2.00. On the other band, teacher-made tests 

might be expected to have kappa values of .25 or greater, as would occur in 

Table 3 for a classroom test having reliability r = .70 and standardized 

cutoff |z| between .00 and 2.00. However, notice that a test may not be 

living up to expectation in terms of the kappa coefficient; and yet the 

overall probability of consistent classification in te«?3 of the agreement 

coefficient may still be acceptable. For example, if a published test has 

reliability r - .80 and standardized cutofi |z| = 2.00, the associated 

kappa value in Ta^le 3 is < - .42, which is below the .50 benchmark 

previously suggested for a standardized test; yet the associated agreement 

coefficient in Table 2 is P 0 - .07, which is above the .95 benchmark 

previously suggested for tests used to make important decisions. This is one 

more illustration of the fact that the agreement and kappa coefficients are 

distinct measures of consistency, requiring individual interpretation. 

0 

Construction of the Tables 

Tables 2 and 3 were constructed using a procedure proposed bv Peng and 
Subkoviak (1980, p. 363) for estimatng the agreement or kappa coefficient. 
This procedure assumes that _if two test administrations were actually 
conducted, the joint distribution of scores on the two testings could be 
approximated by a bivariate normal distribution. Under this assumption, the 
agreement and kappa coefficient are, respectively, given bv 
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p - 1 + 2(p - P ) 
K o zz z 



(6) 



k 5- , 

P z " P z 

where z is the cutoff expressed as a standard score (see Equation 5); P z 

is the pTobroility that a standard normal variable is less than value z: 

and p is the probability that two standard normal variables, having 
z z 

correlation r (see Equation 6), are less than z. 

Tables 2 and 3 were obtained from Equations 6 and 7 by first specifying 
values for z and r and by then determining the corresponding values of 

p ; and p in (6) and (7), which can be obtained by computer routines or 

"zj z z 

from tables of the univariate and bivarlate normal distribution. For examole, 
the first entry in Tables 2 and 3 was obtained by specifying the values 
z - .00, r - .10 and determing the corresponding probabilities p 2 « .5000, 
o » .2659 from the standard univariate and bivariate normal 

*zZ 

distributions. Substituting these values into Equation 6 provides the 

agreement coefficient p 0 - 1 + 2(p zz - P g > - 1 + 2( .2659 - .5000) - .5318 

or p 0 - .53, approximately; and Equation 7 provides the kaopa coefficient 

g „, fry - p 2)/( p - p 2 ) - (.2659 - .5OOO 2 )/(.5O00 - .5000 2 ) - .0636 or .06, 
zz z z z 

approximately. All other entries in Tables 2 and 3 were obtained in the same 
way. 

Peng and Rubkoviak (1980) found that Equations 6 and 7 , on which Tables 
2 and 3 are based, generally provide good approximations even when the test 
data are not normally distributed. They simulated nonnormal data for 125 
different conditions, including U-shaped, uniform, platykurtic, leptokurtic, 
and skewed test score distribution^and they then compared the exact 
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agreement and kappa coefficients for the data to approximations of these two 
^ coefficients given by Equations 6 and 7. The average discrepancy between 
exact and approximate values over all 125 conditions was .013 for the 
agreement coefficient and .037 for the ka^oa coefficient. As would be 
expected, the greatest discrepancies occurred for the most nonnqrmal 
distributions of test scores, which were ll-shaped; the average discrepancy 
over 25 such cases was .019 for the agreement coefficient and .043 for the 
kappa coefficient. As the simulated test score distributions become more 
near-normal, discrepancies between exact and approximate values decreased; for 
example, the average discrepancy over 25 leotokurtic cases was .008 for the 
agreement coefficient and .036 for the kappa coefficient. Thus, it appears 
that Tables 2 and 3 should generally provide practitioners witl/useful 
approximations of the agreement and kappa coefficients over a variety of 
realistic data conditions and with minimal effort. 

For purposes of completeness it might be noted that tables of agreement 
and kappa coefficients have also been produced by Huynh for short tests 
containing between five and ten items; and these tables are reproduced in 
Subkoviak (1980). However, the Huvnh tables are based on the assumption that 
the test data follow 'a beta-binomial distribution rather than a normal 
distribution; as assumed in Tables 2 and 3. 
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Table 1 



Classification of Examinees on Two Test Administrations 



Admin 1 





Admin 


2 




Master 


Nonmaster 


Master 




Pl2 


Nonmaster 


»21 


P22 




<P n + P2l> 


(p 12 + p 22 ) 



(p n + p 12 ) 
(p 21 + P 22 ) 
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Table 2 

Approximate Values of the Agreement Coefficient p 0 



r 


hi 


.10 


.20 


.30 


.40 


.50 


.60 


.70 


.80 


.90 


.00 


.53 


.56 


.60 


.63 


.67 


.70 


.75 


.80 


.86 


.10 


.53 


.57 


.60 


.63 


.67 


.71 


.75 


.80 


.86 


,20 


.54 


.57 


|61 

w 


.64 


.67 


.71 


.75 


.80 


.86 


.30 


.56 


.59 


.62 


.65 


.68 


.72 


.76 


.80 


.86 


.AO 


.58 


.60 


.63 


.66 


.69 


.73 


.77 


.81 


.87 


.50 


.60 


,62 


.65 


.68 


.71 


.74 


.78 


.82 


.87 


.60 


.62 


.65 


.67 


.70 


.73 


.76 


.79 


.83 


.88 • 


.70 


.65 


.67 


.70 


.72 


.75 


.77 


.80 


.84 


.89 


.80 


.68 


.70 


.72 


.74 


.77 


.79 


.82 


.85 


.90 


.90 


.71 


.73 


.75 


.77 


.79 


.81 


.84 


.87 


.90 


1 .00 


.75 


.76 


.77 


.77 


.81 


.83 


.85 


.88 


.91 


1 .10 


.78 


«79 


.80 


.81 


.83 


.85 


.87 


.89 


.92 


1 .20 


.80 


.81 


.82 


.R4 


.85 


.86 


.88 


.90 


.93 


1.30 


.83 


.84 


.85 


.86 


.87 


.88 


.90 


.91 


.94 


1 • £ M> 


QC 
• ow 


• no 


.87 


.88 


.89 


.90 


.91 


.93 


.95 


1 .50 


.88 


.88 


.89 


.90 


.90 


.91 


.92 


.94 


.05 


1 .60 


.90 


.90 


.91 • 


.91 


.92 


.93 


.03 


.95 


.96 


1 .70 


.92 


.02 


.92 


.93 


.93 


.94 


.95 


.95 


.07 


1 .80 


.93 


.93 


.94 


.94 


.94 


.95 


.95 


.96 


.97 


1 .90 


.95 


.95 


.95 


.95 


.95 


.06 


.96 


.97 


.98 


2.00 


.96 


.96 


.96 


.96 


.06 


.07 


.97 


.07 


.98 
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Table 3 

Approximate Values of the Kappa Coefficient < 



r 


Izl 


• 10 

• A » ' 


.20 


.30 


.40 


.50 


.60 


.70 


.80 


.90 


.00 


.06 


.13 


.19 


.26 


.33 


.41 


.49 


.59 


.71 


.10 


.06 


.13 


.19 


.26 


.33 


.41 


.49 


.59 


.71 


.20 


.06 


.13 


.19 


.26 


.33 


.41 


.49 


.59 


.71 


.30 


.06 


.12 


.19 


.26 


.33 


.40 


.49 


.59 


.71 


.AO 


.06 


.12 


.19 


.25 


.32 


.40 


.48 


.58 


.71 




06 


.12 

Ilk 


.18 


.25 


.32 


.40 


.48 


.58 


.70 


• Oil 


•ur> 


1 9 


1R 
. 1 » * 


.24 


.31 


.39 


.47 


.57 


.70 


7 n 


•to 


1 I 
.11 


1 7 
. i / 


.2A 


.31 

. -J A 


.38 


.47 


' .57 


.70 


OA 




1 1 
.11 


.17 


. tm *r 


.30 


.37 


.46 


.56 


.69 


on 


n* 


1 0 
. i yj 


.16 


.22 

. tm tm 


.29 


.36 


.45 


.55 


.68 


i nn 


0^ 


.10 


.IS 


.21 


.28 


.35 


.44 


.54 


.68 


i in 




OQ 


.1 A 


.20 

. at. \J 


.27 


.34 


.43 


.53 


.67 


i ?n 


OA 


.OR 

9 \ fit 


.1A 

. A "T 


.19 


.26 


.33 


.42 


.52 


.66 


i ^n 

1 • 3 w 


OA 


.0ft 


.13 

. A «/ 


.18 


.25 


.32 


.41 


.51 


.65 


1 .40 


.03 


.07 


.12 


.17 


.23 


.31 


.39 


.50 


.64 ' 


1.50 


.03 


.07 


.11 


.16 


.22 


.29 


.38 


.49 


.63 


1 .60 


.03 


.06 


.10 


.15 


.21 


.28 


.37 


.47 


.62 


1.70 


.02 


.05 


.09 


.14 


.20 


.27 


.35 


.46 


.61 


1 .80 


.02 


.05 


.08 


.13 


.18 


.25 


.34 


.45 


.60 


1.90 


.02 


.OA 


.08 


.12 


.17 


.24 


.32 


,43 


.59 


7.00 


.02 


.04 


.07 


.11 


.16 


.22 


.31 


.42 


.58 
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